{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Project Collaboration and Competition - Report"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### DDPG and MADDPG Algorithms "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this project, we use the **DDPG** algorithm (Deep Deterministic Policy Gradient) and the **MADDPG** algorithm,     \n",
    "a wrapper for DDPG. MADDPG stands for **Multi-Agent DDPG**. DDPG is an algorithm which concurrently   learns    \n",
    "a Q-function and a policy.  It uses off-policy data and the Bellman equation to learn the Q-function, \n",
    "and uses    \n",
    "the Q-function to learn the policy. This dual mechanism is the  actor-critic method. The DDPG algorithm uses   \n",
    "two additional mechanisms: _Replay Buffer_ and _Soft Updates_.  \n",
    "\n",
    "In MADDPG, we train two separate agents, and the agents need to **collaborate** (like don’t let the   ball hit the ground)   \n",
    "and **compete** (like gather as many points as possible). Just doing a simple extension of single \n",
    "agent RL    \n",
    "by independently training the two agents does not work very well because the agents are independently updating    \n",
    "their policies as learning progresses. And this causes the   environment to appear non-stationary from the viewpoint   \n",
    "of any one agent. \n",
    "\n",
    "In MADDPG, _each agent’s critic is trained using the observations and actions_ from **both agents** , whereas   \n",
    "each _agent’s actor is trained using just_ its **own observations**.  \n",
    "\n",
    "In the finction _step()_ of the _class madppg_\\__agent_, we collect all current info\n",
    " for **both agents**  into  the **common** variable    \n",
    "_memory_ of the type  _ReplayBuffer_.  Then we get the random _sample_ from _memory_  into the variable _experiance_.   \n",
    "This _experiance_   together with the current number of agent (0 or 1) go to the function _learn()_.   We get the corresponding    \n",
    "agent (of type _ddpg_\\__agent_):\n",
    "\n",
    "      agent = self.agents[agent_number]\n",
    "\n",
    "and _experiance_ is transferred to function _learn()_  of the _class ddpg_\\__agent_.  There, the actor and the critic \n",
    "are handled by different ways.  \n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    " ### Eight Neural Networks"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this prohect, there are 8 _neural networks_.  For the training, we \n",
    "create one _maddpg agent_. \n",
    "\n",
    "         maddpg = maddpg_agent()\n",
    "         \n",
    "In turn, _maddpg agent_ creates 2 _ddpg agents_: \n",
    "         \n",
    "         self.agents = [ddpg_agent(state_size, action_size, i+1, random_seed=0) \n",
    "                  for i in range(num_agents)]    \n",
    "\n",
    "Each of two agents (red and blue) create 4 neural networks:\n",
    "\n",
    "        self.actor_local = Actor(state_size, action_size).to(device)\n",
    "        self.actor_target = Actor(state_size, action_size).to(device)\n",
    "\n",
    "        self.critic_local = Critic(state_size, action_size).to(device)\n",
    "        self.critic_target = Critic(state_size, action_size).to(device)\n",
    "\n",
    "Classes Actor and Critic are provided by **model.py**. The typical behavior of the actor \n",
    "\n",
    "        actor_target(state) -> next_actions\n",
    "        actor_local(states) -> actions_pred\n",
    "        \n",
    "see function _learn()_ in maddpg agent. The typical behavior of the critic is as follows:\n",
    "\n",
    "        critic_target(state, action) -> Q-value \n",
    "        -critic_local(states, actions_pred) -> actor_loss\n",
    "        \n",
    "see function _learn()_ in ddpg agent.        \n",
    "        "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Architecture of the actor and critic networks"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Both the actor and critic classes implement the neural network\n",
    "with 3 fully-connected layers and 2 rectified nonlinear layers. These networks are realized in the framework\n",
    "of package PyTorch. Such a network is used in Udacity model.py code for the Pendulum model using DDPG.\n",
    "The number of neurons of the fully-connected layers are as follows:\n",
    "\n",
    "for the actor:   \n",
    "Layer fc1, number of neurons: state_size x fc1_units,   \n",
    "Layer fc2, number of neurons: fc1_units x fc2_units,   \n",
    "Layer fc3, number of neurons: fc2_units x action_size,   \n",
    "\n",
    "for the critic:   \n",
    "Layer fcs1, number of neurons: (state_size + action_size) x n_agents x fcs1_units,   \n",
    "Layer fc2, number of neurons: (fcs1_units x fc2_units,   \n",
    "Layer fc3, number of neurons: fc2_units x 1.   \n",
    "\n",
    "Here, state_size = 24, action_size = 2.       \n",
    "The input parameters fc1_units, fc2_units, fcs1_units are all taken = 64.   "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Hyperparameters"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "From **ddpg_agent.py** \n",
    "\n",
    "        GAMMA = 0.99                    # discount factor  \n",
    "        TAU = 5e-2                      # for soft update of target parameters   \n",
    "        LR_ACTOR = 5e-4                 # learning rate of the actor   \n",
    "        LR_CRITIC = 5e-4                # learning rate of the critic  \n",
    "        WEIGHT_DECAY = 0.0              # L2 weight decay   \n",
    "        NOISE_AMPLIFICATION = 1         # exploration noise amplification  \n",
    "        NOISE_AMPLIFICATION_DECAY = 1   # noise amplification decay\n",
    "\n",
    "From **maddpg_agent.py**\n",
    "\n",
    "        BUFFER_SIZE = int(1e6)          # replay buffer size   \n",
    "        BATCH_SIZE = 512                # minibatch size   \n",
    "        LEARNING_PERIOD = 2             # weight update frequency \n",
    "        \n",
    "Note that parameters LEARNING_PERIOD is important. The corresponding code is in the function   _step()_.\n",
    "\n",
    "     if len(self.memory) > BATCH_SIZE and timestep % LEARNING_PERIOD == 0: \n",
    "         for a_i, agent in enumerate(self.agents):\n",
    "              experiences = self.memory.sample()\n",
    "              self.learn(experiences, a_i)\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Training the Agent"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "On my local machine with GPU, the desired average reward **+0.5** was achieved in **1302** episodes in **18** minutes.\n",
    "However, the training continued until **1700** episodes, where average reward was achived **+1.42** in **1 hour 18 minutes**."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Full log"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Episode: 20, Score: -0.0050, \tAverage Score: -0.0050, Time: 00:00:00   \n",
    "Episode: 40, Score: -0.0050, \tAverage Score: -0.0050, Time: 00:00:02   \n",
    "Episode: 60, Score: -0.0050, \tAverage Score: -0.0050, Time: 00:00:07   \n",
    "Episode: 80, Score: -0.0050, \tAverage Score: -0.0050, Time: 00:00:12   \n",
    "Episode: 100, Score: -0.0050, \tAverage Score: -0.0050, Time: 00:00:18    \n",
    "Episode: 120, Score: -0.0050, \tAverage Score: -0.0040, Time: 00:00:23    \n",
    "Episode: 140, Score: -0.0050, \tAverage Score: -0.0025, Time: 00:00:30   \n",
    "Episode: 160, Score: -0.0050, \tAverage Score: -0.0020, Time: 00:00:36   \n",
    "Episode: 180, Score: -0.0050, \tAverage Score: -0.0015, Time: 00:00:43    \n",
    "Episode: 200, Score: 0.0450, \tAverage Score: 0.0040, Time: 00:00:53   \n",
    "Episode: 220, Score: -0.0050, \tAverage Score: 0.0070, Time: 00:01:03   \n",
    "Episode: 240, Score: 0.0450, \tAverage Score: 0.0065, Time: 00:01:09   \n",
    "Episode: 260, Score: -0.0050, \tAverage Score: 0.0085, Time: 00:01:17   \n",
    "Episode: 280, Score: -0.0050, \tAverage Score: 0.0080, Time: 00:01:22    \n",
    "Episode: 300, Score: -0.0050, \tAverage Score: 0.0045, Time: 00:01:30    \n",
    "Episode: 320, Score: -0.0050, \tAverage Score: 0.0035, Time: 00:01:38    \n",
    "Episode: 340, Score: 0.0450, \tAverage Score: 0.0045, Time: 00:01:46    \n",
    "Episode: 360, Score: -0.0050, \tAverage Score: 0.0030, Time: 00:01:52    \n",
    "Episode: 380, Score: -0.0050, \tAverage Score: 0.0030, Time: 00:01:58    \n",
    "Episode: 400, Score: -0.0050, \tAverage Score: 0.0010, Time: 00:02:04    \n",
    "Episode: 420, Score: -0.0050, \tAverage Score: -0.0005, Time: 00:02:11    \n",
    "Episode: 440, Score: -0.0050, \tAverage Score: 0.0010, Time: 00:02:20    \n",
    "Episode: 460, Score: -0.0050, \tAverage Score: 0.0015, Time: 00:02:27    \n",
    "Episode: 480, Score: -0.0050, \tAverage Score: 0.0035, Time: 00:02:35    \n",
    "Episode: 500, Score: -0.0050, \tAverage Score: 0.0085, Time: 00:02:43    \n",
    "Episode: 520, Score: -0.0050, \tAverage Score: 0.0075, Time: 00:02:49    \n",
    "Episode: 540, Score: -0.0050, \tAverage Score: 0.0065, Time: 00:02:57    \n",
    "Episode: 560, Score: -0.0050, \tAverage Score: 0.0050, Time: 00:03:03   \n",
    "Episode: 580, Score: -0.0050, \tAverage Score: 0.0050, Time: 00:03:10    \n",
    "Episode: 600, Score: -0.0050, \tAverage Score: 0.0030, Time: 00:03:17    \n",
    "Episode: 620, Score: -0.0050, \tAverage Score: 0.0050, Time: 00:03:25    \n",
    "Episode: 640, Score: -0.0050, \tAverage Score: 0.0110, Time: 00:03:38    \n",
    "Episode: 660, Score: 0.0950, \tAverage Score: 0.0130, Time: 00:03:46    \n",
    "Episode: 680, Score: -0.0050, \tAverage Score: 0.0210, Time: 00:03:59    \n",
    "Episode: 700, Score: -0.0050, \tAverage Score: 0.0185, Time: 00:04:05    \n",
    "Episode: 720, Score: -0.0050, \tAverage Score: 0.0160, Time: 00:04:10    \n",
    "Episode: 740, Score: -0.0050, \tAverage Score: 0.0075, Time: 00:04:16    \n",
    "Episode: 760, Score: -0.0050, \tAverage Score: 0.0065, Time: 00:04:23    \n",
    "Episode: 780, Score: -0.0050, \tAverage Score: 0.0045, Time: 00:04:34    \n",
    "Episode: 800, Score: -0.0050, \tAverage Score: 0.0080, Time: 00:04:43    \n",
    "Episode: 820, Score: 0.0450, \tAverage Score: 0.0110, Time: 00:04:51   \n",
    "Episode: 840, Score: 0.1450, \tAverage Score: 0.0180, Time: 00:05:07   \n",
    "Episode: 860, Score: 0.0450, \tAverage Score: 0.0305, Time: 00:05:28  \n",
    "Episode: 880, Score: 0.0450, \tAverage Score: 0.0315, Time: 00:05:42   \n",
    "Episode: 900, Score: 0.0450, \tAverage Score: 0.0415, Time: 00:06:00    \n",
    "Episode: 920, Score: 0.0450, \tAverage Score: 0.0510, Time: 00:06:17   \n",
    "Episode: 940, Score: 0.0450, \tAverage Score: 0.0545, Time: 00:06:34   \n",
    "Episode: 960, Score: 0.1450, \tAverage Score: 0.0535, Time: 00:06:51   \n",
    "Episode: 980, Score: 0.0450, \tAverage Score: 0.0610, Time: 00:07:12   \n",
    "Episode: 1000, Score: 0.0450, \tAverage Score: 0.0615, Time: 00:07:28  \n",
    "Episode: 1020, Score: -0.0050, \tAverage Score: 0.0550, Time: 00:07:38  \n",
    "Episode: 1040, Score: 0.1450, \tAverage Score: 0.0510, Time: 00:07:48   \n",
    "Episode: 1060, Score: 0.0450, \tAverage Score: 0.0570, Time: 00:08:08  \n",
    "Episode: 1080, Score: 0.0450, \tAverage Score: 0.0565, Time: 00:08:25  \n",
    "Episode: 1100, Score: 0.0450, \tAverage Score: 0.0670, Time: 00:08:49  \n",
    "Episode: 1120, Score: 0.1450, \tAverage Score: 0.0865, Time: 00:09:15  \n",
    "Episode: 1140, Score: 0.0450, \tAverage Score: 0.1115, Time: 00:09:46  \n",
    "Episode: 1160, Score: 0.0450, \tAverage Score: 0.1170, Time: 00:10:11  \n",
    "Episode: 1180, Score: 0.2950, \tAverage Score: 0.1250, Time: 00:10:36  \n",
    "Episode: 1200, Score: -0.0050, \tAverage Score: 0.1160, Time: 00:10:53  \n",
    "Episode: 1220, Score: 2.3450, \tAverage Score: 0.1350, Time: 00:11:39  \n",
    "Episode: 1240, Score: 0.0450, \tAverage Score: 0.1660, Time: 00:12:38  \n",
    "Episode: 1260, Score: 0.1950, \tAverage Score: 0.2350, Time: 00:14:01  \n",
    "Episode: 1280, Score: 0.0450, \tAverage Score: 0.3336, Time: 00:15:56  \n",
    "Episode: 1300, Score: 0.0450, \tAverage Score: 0.4592, Time: 00:18:07  \n",
    " \n",
    "*** Environment solved in 1302 episodes!\tAverage Score: 0.51 ***   \n",
    "\n",
    "Episode: 1320, Score: 0.0450, \tAverage Score: 0.5434, Time: 00:20:07   \n",
    "Episode: 1340, Score: 0.0450, \tAverage Score: 0.4999, Time: 00:20:30   \n",
    "Episode: 1360, Score: 0.0450, \tAverage Score: 0.5736, Time: 00:22:51   \n",
    "Episode: 1380, Score: 0.0450, \tAverage Score: 0.4900, Time: 00:23:33   \n",
    "Episode: 1400, Score: 0.0450, \tAverage Score: 0.3955, Time: 00:24:17   \n",
    "Episode: 1420, Score: 0.0450, \tAverage Score: 0.3785, Time: 00:25:51   \n",
    "Episode: 1440, Score: 2.6000, \tAverage Score: 0.6059, Time: 00:29:11   \n",
    "*** Episode 1450\tAverage Score: 0.55, Time: 00:29:18 ***   \n",
    " \n",
    "Episode: 1460, Score: 1.8950, \tAverage Score: 0.5722, Time: 00:31:11   \n",
    "Episode: 1480, Score: 0.0450, \tAverage Score: 0.6788, Time: 00:33:26   \n",
    "Episode: 1500, Score: 0.7950, \tAverage Score: 0.9267, Time: 00:37:45   \n",
    "*** Episode 1500\tAverage Score: 0.93, Time: 00:37:45 ***   \n",
    " \n",
    "Episode: 1520, Score: -0.0050, \tAverage Score: 0.9882, Time: 00:40:17   \n",
    "Episode: 1540, Score: 2.6000, \tAverage Score: 0.9486, Time: 00:43:17   \n",
    "*** Episode 1550\tAverage Score: 1.11, Time: 00:46:03 ***\n",
    " \n",
    "Episode: 1560, Score: 2.6500, \tAverage Score: 1.0685, Time: 00:47:36   \n",
    "Episode: 1580, Score: 2.6000, \tAverage Score: 1.1664, Time: 00:51:30   \n",
    "Episode: 1600, Score: 0.0450, \tAverage Score: 1.1409, Time: 00:55:57   \n",
    "*** Episode 1600\tAverage Score: 1.14, Time: 00:55:57 ***  \n",
    " \n",
    "Episode: 1620, Score: 1.4950, \tAverage Score: 1.1379, Time: 00:58:45   \n",
    "Episode: 1640, Score: 0.0950, \tAverage Score: 1.1464, Time: 01:02:11   \n",
    "*** Episode 1650\tAverage Score: 1.16, Time: 01:05:03 ***  \n",
    " \n",
    "Episode: 1660, Score: 0.0450, \tAverage Score: 1.2741, Time: 01:08:14   \n",
    "Episode: 1680, Score: 2.6000, \tAverage Score: 1.3547, Time: 01:13:11    \n",
    "Episode: 1700, Score: 0.0950, \tAverage Score: 1.4214, Time: 01:18:16    \n",
    "*** Episode 1700\tAverage Score: 1.42, Time: 01:18:16 ***  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Future ideas"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "1. Check different values for hyperparameters such as LEARNING_PERIOD, and neural network parameters fc1_units, fc2_units, etc.\n",
    "2. How does the addition of new nonlinear layers in the used neural networks affect the robustness of the algorithm.\n",
    "3. It would be interesting to train agents using [MAPPO](https://github.com/kotogasy/unity-ml-tennis) and compare them with MADDPG. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
